Gittins and Whittle's Index

Bandit Problem

Markov Bandit Process

img-20240926210614018|400

Objective Function

maxJπ(ξ)=limTE[t=0T1atrit(ξit)|ξ(0)=ξ]

At time t, we know the state ξ, the probabilities, the discount factor and the reward function for each bandit.

Examples

img-20240926212053726|500
img-20240926212109899|500

Gittins Index

Multi Armed Bandit Problem (open problem for almost 40 years)

Index Policy

Index Policy

vi(ξi), computed separately for each bandit, such that, for every state ξ, the optimal policy continues the bandit:

it=argmaxi{1,...,n}{vi(ξi)}

Notice that it only depends on the parameters associated with a single bandit. But how such function should be designed?

maxJπ(ξ)=limTE[t=0T1atrit(ξit)|ξ(0)=ξ]
Index Theorem

Optimal policy for this problem is an Index policy.

Derivation of the Index – Single bandit with charge

J(ξi)=maxπJπ(ξi)=supτ>0E[t=0τ1at[ri(ξi(t))λ]|ξi(0)=ξi] vi(ξi)=supτ>0E[t=0τ1atri(ξi(t))ξi(0)=ξi]E[t=0τ1atξi(0)=ξi]

Gittins Index

Going back to the Simple Family of Alternative Bandit Processes with n bandits and no playing charge. The Gittins index associated with bandit i in state ξi is

vi(ξi)=supτ>0E[t=0τ1atri(ξi(t))ξi(0)=ξi]E[t=0τ1atξi(0)=ξi]

where τ is the stopping-time.

Numerator is the discounted REWARD up to time τ. Denominator is the discounted TIME up to time τ.


“greatest per period rent that one would be willing to pay for ownership of the rewards arising from the bandit as it is continued for one or more periods.”

因为 index 代表了在不赚不赔时付出的 cost(fair charge),因此对每个 bandit愿意付出的 cost 越大,说明其可能获得的 reward 越高

This proof is instructive because:

  1. provides insight into why the Gittins Index Policy is optimal
  2. provides insight into why it is NOT optimal for the restless case;

More details: @GittinsIndexMultiarmed

Whittle Index

Restless Multi Armed Bandit Problem

Three Optimization Problems

Original

maximizelimTE[t=0T1ati=1nri(ξi,ui)]s.t.i=1nui(t)=m,tui(t){0,1},i

Relaxed

Problem with Relaxed activation constraint.

maximizelimTE[t=0T1ati=1nri(ξi,ui)]s.t.t=0ati=1nui(t)=m/(1a)ui(t){0,1},i

Lagrange

The Lagrange Dual Function is given by:

L(λ)=maximizelimTE[t=0T1ati=1n(ri(ξi,ui)λui(t))]+λ(m/(1a))s.t. ui(t){0,1},i

Notice that we can decouple this problem and neglect the last term (constant).
【Decoupled Problem】(Similar to Gittins)

maximizelimTE[t=0T1at(ri(ξi,ui)λui(t))]s.t. ui(t){0,1},i

Indexability

img-20240927145248765|500
Means that if a bandit is rested with λ, it should also be rested when λ>λ. 【——>Threshold】

WI policy

Whittle Index

Consider the Decoupled Problem and denote by vi(ξi) the Whittle Index in state ξi. Given indexability, vi(ξi) is the infimum playing charge λ that makes it equally desirable to play and to stop in state ξi.

img-20240927145701479|500
考虑到前面 Original Problem 的约束,可以得到如下 WI policy:
【Whittle Index Policy】 At every decision time t, selects the m bandits with higher values of vi(ξi).


  1. J. Gittins, K. Glazebrook and R. Weber, Multi-armed Bandit Allocation Indices, 2 Ed., 2011. ↩︎


© 2024 LiQ :) 由 Obsidian&Github 强力驱动